INTERSPEECH.2023 - Speech Recognition

Total: 259

#1 DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models [PDF2] [Copy] [Kimi15]

Authors: Yifan Peng ; Yui Sudo ; Shakeel Muhammad ; Shinji Watanabe

Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. Knowledge distillation trains a small student model to mimic the behavior of a large teacher model. However, the student architecture usually needs to be manually designed and will remain fixed during training, which requires prior knowledge and can lead to suboptimal performance. Inspired by recent success of task-specific structured pruning, we propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning. Experiments on SUPERB show that DPHuBERT outperforms pure distillation methods in almost all tasks. Moreover, DPHuBERT requires little training time and performs well with limited training data, making it suitable for resource-constrained applications. Our method can also be applied to various speech SSL models. Our code and models will be publicly available.

#2 Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations [PDF3] [Copy] [Kimi7]

Authors: Salah Zaiem ; Titouan Parcollet ; Slim Essid

Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains. It consists in applying properly calibrated data augmentations on a large clean dataset, bringing it closer to the target domain, and using it as part of an initial fine-tuning stage. Augmentations are automatically selected through the minimization of a conditional-dependence estimator, based on the target dataset. The approach is validated during an oracle experiment with controlled distortions and on two amateur-collected low-resource domains, reaching better performances compared to the baselines in both cases.

#3 Dual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition [PDF3] [Copy] [Kimi3]

Authors: Zhao Yang ; Dianwen Ng ; Chong Zhang ; Xiao Fu ; Rui Jiang ; Wei Xi ; Yukun Ma ; Chongjia Ni ; Eng Siong Chng ; Bin Ma ; Jizhong Zhao

The integration of well-pre-trained acoustic and linguistic representations boosts the performance of speech-to-text cross-modality tasks. However, the potential of fine-tuning cross-modality integrated model on accented and noisy corpus is still under-explored. To address this gap, we propose an end-to-end acoustic and linguistic integrated representation learning model, namely Dual-w2v-BART. Our model incorporates acoustic representations from wav2vec2.0 and linguistic information from BART model by utilizing the cross-attention mechanism in the decoder, with paired speech-text dual inputs. To enhance model robustness on accent and noise, we propose a text-centric representation consistency component that helps to gain the similarity between different modality inputs while representing the same content. The results on accented and noisy speech recognition tasks demonstrate the effectiveness of the proposed model for reducing error rates compared to baseline and other competitive models.

#4 O-1: Self-training with Oracle and 1-best Hypothesis [PDF2] [Copy] [Kimi1]

Authors: Murali Karthick Baskar ; Andrew Rosenberg ; Bhuvana Ramabhadran ; Kartik Audhkhasi

We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80% relative compared to EMBR which bridges the gap by 43% relative. O-1 achieves 13% to 25% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9% relative improvement in WER over EMBR, thereby speaking to the scalability of the proposed objective for large-scale datasets.

#5 MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets [PDF4] [Copy] [Kimi4]

Authors: Ziyang Ma ; Zhisheng Zheng ; Changli Tang ; Yujin Wang ; Xie Chen

In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

#6 Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models [PDF2] [Copy] [Kimi1]

Authors: Léa-Marie Lam-Yee-Mui ; Lucas Ondel Yang ; Ondřej Klejch

This paper investigates the potential of improving a hybrid automatic speech recognition model trained on 10 hours of transcribed data with 200 hours of untranscribed data in low-resource languages. First, we compare baseline methods of cross-lingual transfer with MFCC features and features extracted with the multilingual self-supervised model XLSR-53. Subsequently, we compare two approaches that can leverage the untranscribed data: semi-supervised training with LF-MMI and continued self-supervised pre-training of XLSR-53. Our results on well-resourced English broadcast data derived from MGB show that both methods achieve 18% and 27% relative improvements compared to the baseline, respectively. On the low-resource South African Soap Opera dataset, the relative improvement with semi-supervised training is only 3% due to the inherently weak language model. However, continued pre-training achieves 8.6% relative improvement because it does not rely on any external information.

#7 Using Text Injection to Improve Recognition of Personal Identifiers in Speech [PDF] [Copy] [Kimi1]

Authors: Yochai Blau ; Rohan Agrawal ; Lior Madmony ; Gary Wang ; Andrew Rosenberg ; Zhehuai Chen ; Zorik Gekhman ; Genady Beryozkin ; Parisa Haghani ; Bhuvana Ramabhadran

Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or eliminate Personally Identifiable Information (PII) from collection altogether. However, this results in ASR models that tend to have lower recognition accuracy of these categories. We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. We demonstrate substantial improvement to Recall of Names and Dates in medical notes while improving overall WER. For alphanumeric digit sequences we show improvements to Character Error Rate and Sentence Accuracy.

#8 Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model [PDF1] [Copy] [Kimi1]

Authors: Tamas Grosz ; Yaroslav Getman ; Ragheb Al-Ghezi ; Aku Rouhe ; Mikko Kurimo

Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.

#9 Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech [PDF2] [Copy] [Kimi1]

Authors: Jan Lehečka ; Jan Švec ; Josef V. Psutka ; Pavel Ircing

This paper is a step forward in our effort to make vast oral history archives more accessible to the public and researchers by breaking down the decoding barriers between the knowledge encoded in the spoken testimonies and users who want to search for the information of their interest. We present new Transformer-based monolingual models suitable for speech recognition of oral history archives in English, German, and Czech. Our experiments show that although the all-purpose speech recognition systems have recently made tremendous progress, the transcription of oral history archives is still a challenging task for them; our tailored models significantly outperformed larger public multilingual models and scored new state-of-the-art results on all tested datasets. Due to the 2-phase fine-tuning process, our models are robust and can be used for oral history archives of various domains. We publicly release our models within a public speech recognition service.

#10 Iteratively Improving Speech Recognition and Voice Conversion [PDF2] [Copy] [Kimi2]

Authors: Mayank Kumar Singh ; Naoya Takahashi ; Naoyuki Onoe

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.

#11 LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems [PDF] [Copy] [Kimi1]

Authors: Kavan Fatehi ; Ayse Kucukyilmaz

With advances in deep learning methodologies, Automatic Speech Recognition (ASR) systems have seen impressive results. However, ASR in Low-Resource Environments (LREs) are challenged by a lack of training data for the specific target domain. We propose that data sampling criteria for choosing more informative speech samples can be critical to addressing the problem of training data bottleneck. Our proposed Local Aggregation BERT (LABERT) method for self-supervised speech representation learning fuses an active learning model with an adapted local aggregation metric. Active learning is used to pick informative speech units, whereas the aggregation metric forces the model to move similar data together in the latent space while separating dissimilar instances to detect hidden units in LRE tasks. We evaluate LABERT with two LRE datasets: I-CUBE and UASpeech to explore the performance of our model in the LRE ASR problems.

#12 TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Hongfei Xue ; Qijie Shao ; Peikun Chen ; Pengcheng Guo ; Lei Xie ; Jie Liu

UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this paper, we propose TranUSR, a two-stage model comprising a pre-trained UniData2vec and a phoneme-to-word Transcoder. Different from UniSpeech, UniData2vec replaces the quantized discrete representations with continuous and contextual representations from a teacher model for phonetically-aware pre-training. Then, Transcoder learns to translate phonemes to words with the aid of extra texts, enabling direct word generation. Experiments on Common Voice show that UniData2vec reduces PER by 5.3% compared to UniSpeech, while Transcoder yields a 14.4% WER reduction compared to grapheme fine-tuning.

#13 Dual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR [PDF] [Copy] [Kimi1]

Authors: Zelin Wu ; Tsendsuren Munkhdalai ; Pat Rondon ; Golan Pundak ; Khe Chai Sim ; Christopher Li

ASR systems in real applications must be adapted on the fly to correctly recognize task-specific contextual terms, such as contacts, application names and media entities. However, it is challenging to achieve scalability, large in-domain quality gains, and minimal out-of-domain quality regressions simultaneously. In this work, we introduce an effective neural biasing architecture called Dual-Mode NAM. Dual-Mode NAM embeds a top-k search process in its attention mechanism in a trainable fashion to perform an accurate top-k phrase selection before injecting the corresponding word-piece context into the acoustic encoder. We further propose a controllable mechanism to enable the ASR system to be able to trade off its in-domain and out-of-domain quality at inference time. When evaluated on a large-scale biasing benchmark, the combined techniques improve a previously proposed method with an average in-domain and out-of-domain WER reduction by up to 53.3% and 12.0% relative respectively.

#14 GhostRNN: Reducing State Redundancy in RNN with Cheap Operations [PDF] [Copy] [Kimi1]

Authors: Hang Zhou ; Xiaoxu Zheng ; Yunhe Wang ; Michael Bi Mi ; Deyi Xiong ; Kai Han

Recurrent neural network (RNNs) that are capable of modeling long-distance dependencies are widely used in various speech tasks, eg., keyword spotting (KWS) and speech enhancement (SE). Due to the limitation of power and memory in low-resource devices, efficient RNN models are urgently required for real-world applications. In this paper, we propose an efficient RNN architecture, GhostRNN, which reduces hidden state redundancy with cheap operations. In particular, we observe that partial dimensions of hidden states are similar to the others in trained RNN models, suggesting that redundancy exists in specific RNNs. To reduce the redundancy and hence computational cost, we propose to first generate a few intrinsic states, and then apply cheap operations to produce ghost states based on the intrinsic states. Experiments on KWS and SE tasks demonstrate that the proposed GhostRNN significantly reduces the memory usage (~40%) and computation cost while keeping performance similar. Codes will be available at https://gitee.com/mindspore/models/tree/master/research/audio/ghostrnn

#15 Task-Agnostic Structured Pruning of Speech Representation Models [PDF] [Copy] [Kimi1]

Authors: Haoyu Wang ; Siyuan Wang ; Wei-Qiang Zhang ; Suo Hongbin ; Yulong Wan

Self-supervised pre-trained models such as Wav2vec2, Hubert, and WavLM have been shown to significantly improve many speech tasks. However, their large memory and strong computational requirements hinder their industrial applicability. Structured pruning is a hardware-friendly model compression technique but usually results in a larger loss of accuracy. In this paper, we propose a fine-grained attention head pruning method to compensate for the performance degradation. In addition, we also introduce the straight through estimator into the L0 regularization to further accelerate the pruned model. Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks and outperforms the Wav2vec 2.0 base model on average, with 72% fewer parameters and 2 times faster inference speed.

#16 Factual Consistency Oriented Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Naoyuki Kanda ; Takuya Yoshioka ; Yang Liu

This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.

#17 Multi-Head State Space Model for Speech Recognition [PDF1] [Copy] [Kimi1]

Authors: Yassir Fathullah ; Chunyang Wu ; Yuan Shangguan ; Junteng Jia ; Wenhan Xiong ; Jay Mahadeokar ; Chunxi Liu ; Yangyang Shi ; Ozlem Kalinli ; Mike Seltzer ; Mark J. F. Gales

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76%/4.37% on the development and 1.91%/4.36% on the test sets without using an external language model.

#18 Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search [PDF] [Copy] [Kimi1]

Authors: Yingying Gao ; Shilei Zhang ; Zihao Cui ; Chao Deng ; Junlan Feng

Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end cascaded multi-task models based on Neural Architecture Search (NAS) framework. The candidate adaptive operations on each specific module consist of frozen, inserting an adapter and fine-tuning. We further add a penalty item on the loss to limit the learned structure which takes the amount of trainable parameters into account. The penalty item successfully restrict the searched architecture and the proposed approach is able to search similar tuning scheme with hand-craft, compressing the optimizing parameters to 8.7% corresponding to full fine-tuning on SLURP with an even better performance.

#19 Probing Self-supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration [PDF] [Copy] [Kimi1]

Authors: Kinan Martin ; Jon Gauthier ; Canaan Breiss ; Roger Levy

Textless self-supervised speech models have grown in capabilities in recent years, but the nature of the linguistic information they encode has not yet been thoroughly examined. We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans, focusing on a set of phonetic (low-level) and phonemic (more abstract) contrasts instantiated in word-initial stops. We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures, and are preserved in the principal components of deeper layer representations. Our analyses suggest two sources for this success: some can only be explained by the optimization of the models on speech data, while some can be attributed to these models' high-dimensional architectures. Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.

#20 Selective Biasing with Trie-based Contextual Adapters for Personalised Speech Recognition using Neural Transducers [PDF] [Copy] [Kimi1]

Authors: Philip Harding ; Sibo Tong ; Simon Wiesler

Neural transducer ASR models achieve state of the art accuracy on many tasks, however rare word recognition poses a particular challenge as models often fail to recognise words that occur rarely, or not at all, in the training data. Methods of contextual biasing, where models are dynamically adapted to bias their outputs towards a given list of relevant words and phrases, have been shown to be effective at alleviating this issue. While such methods are effective at improving rare word recognition, over-biasing can lead to degradation on common words. In this work we propose several extensions to a recently proposed trie-based method of contextual biasing. We show how performance of the method can be improved in terms of rare word recognition, especially in the case of very large catalogues, by introducing a simple normalisation term, how the method can be trained as an adapter module, and how selective biasing can be applied to practically eliminate over-biasing on common words.

#21 Diacritic Recognition Performance in Arabic ASR [PDF] [Copy] [Kimi1]

Authors: Hanan Aldarmaki ; Ahmad Ghannam

In Arabic text, most vowels are encoded in the form of diacritics that are often omitted, so most speech corpora and ASR models are undiacritized. Text-based diacritization has previously been used to preprocess the input or post-processs ASR hypotheses. It is generally believed that input diacritization degrades ASR quality, but no systematic evaluation of ASR diacritization performance has been conducted to date. We experimentally clarify whether input diacritiztation indeed degrades ASR quality and compare ASR diacritization with text-based diacritization. We fine-tune pre-trained ASR models on transcribed speech with different diacritization conditions: manual, automatic, and no diacritization. We isolate diacritic recognition performance from the overall ASR performance using coverage and precision metrics. We find that ASR diacritization significantly outperforms text-based diacritization, particularly when the ASR model is fine-tuned with manually diacritized transcripts.

#22 Personalization for BERT-based Discriminative Speech Recognition Rescoring [PDF] [Copy] [Kimi1]

Authors: Jari Kolehmainen ; Yile Gu ; Aditya Gourav ; Prashanth Gurunath Shivakumar ; Ankur Gandhe ; Ariya Rastrow ; Ivan Bulyko

Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.

#23 On the N-gram Approximation of Pre-trained Language Models [PDF] [Copy] [Kimi1]

Authors: Aravind Krishnan ; Jesujoba O. Alabi ; Dietrich Klakow

Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks, particularly in low-resource settings. Nevertheless, their potential in Automatic Speech Recognition (ASR) remains largely unexplored. This study investigates the potential usage of PLMs for language modelling in ASR. We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model. Furthermore, we introduce a vocabulary-restricted decoding method for random sampling, and evaluate the effects of domain difficulty and data size on the usability of generated text. Our findings across eight domain-specific corpora support the use of sampling-based approximation and show that interpolating with a large sampled corpus improves test perplexity over a baseline trigram by 15%. Our vocabulary-restricted decoding method pushes this improvement further by 5% in domain-specific settings.

#24 Record Deduplication for Entity Distribution Modeling in ASR Transcripts [PDF] [Copy] [Kimi1]

Authors: Tianyu Huang ; Chung Hoon Hong ; Carl Wivagg ; Kanna Shimizu

Voice digital assistants must keep up with trending search queries. We rely on a speech recognition model using contextual biasing with a rapidly updated set of entities, instead of frequent model retraining, to keep up with trends. There are several challenges with this approach: (1) the entity set must be frequently reconstructed, (2) the entity set is of limited size due to latency and accuracy trade-offs, and (3) finding the true entity distribution for biasing is complicated by ASR misrecognition. We address these challenges and define an entity set by modeling customers' true requested entity distribution from ASR output in production using record deduplication, a technique from the field of entity resolution. Record deduplication resolves or deduplicates coreferences, including misrecognitions, of the same latent entity. Our method successfully retrieves 95% of misrecognized entities and when used for contextual biasing shows an estimated 5% relative word error rate reduction.

#25 Learning When to Trust Which Teacher for Weakly Supervised ASR [PDF] [Copy] [Kimi1]

Authors: Aakriti Agrawal ; Milind Rao ; Anit Kumar Sahu ; Gopinath Chennupati ; Andreas Stolcke

Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.